feat: Implement optimization code paths and functionality for initial release by andrewklatzke · Pull Request #140 · launchdarkly/python-server-sdk-ai

andrewklatzke · 2026-04-17T17:18:56Z

Requirements

I have added test coverage for new or changed functionality
I have followed the repository's pull request submission guidelines
I have validated my changes against all supported platform versions

Related issues

This PR encapsulates all previous changes in the chain of optimization PRs that were broken up into smaller pieces. Consolidating here so that we can have a single commit/release of the package. The PRs were independently reviewed and approved.

Describe the solution you've provided

See:

#116
#117
#119
#122
#127
#128
#130
#135
#139

Note

High Risk
Large new surface (~3k+ lines) that drives LLM workflows, uses LAUNCHDARKLY_API_KEY for REST writes (including auto-commit variations), and persists optimization state—bugs could affect production AI configs or leak mishandled secrets in logs despite redaction filters.

Overview
Replaces the ldai_optimization scaffold with ldai_optimizer (launchdarkly-ai-optimizer on PyPI): the placeholder ApiAgentOptimizationClient is removed in favor of a full OptimizationClient that runs iterative prompt optimization while your code supplies all LLM calls via callbacks.

Entry points: optimize_from_options (random inputs + judges), optimize_from_ground_truth_options (all samples must pass per attempt), and optimize_from_config (remote agent-optimization config + live result persistence). Optional auto_commit creates a new AI Config variation via the REST API when LAUNCHDARKLY_API_KEY is set.

The loop covers judge evaluation (LD flag judges and inline acceptance statements), validation sampling after a pass, LLM-driven variation generation with safeguards (tool/placeholder preservation), optional latency/token phase-2 tuning with gates, token budgets, and status callbacks. LDApiClient wraps agent-optimization and AI Config endpoints with retries; config-driven runs POST/PATCH iteration results for the UI.

Packaging/docs: wheel package path and Makefile lint targets updated, README quick starts and PROVENANCE.md added for attestations.

^{Reviewed by Cursor Bugbot for commit 69f1804. Bugbot is set up for automated code reviews on this repo. Configure here.}

…andler

…ype, remove required context_choices argument and default to anon

**Requirements** - [x] I have added test coverage for new or changed functionality - [x] I have followed the repository's [pull request submission guidelines](../blob/main/CONTRIBUTING.md#submitting-pull-requests) - [x] I have validated my changes against all supported platform versions **Describe the solution you've provided** Implements better handling for params on the initial variation (folds model changes in as overwrites rather than completely replacing) and ensures custom params are persisted unchanged. Additionally makes sure that tools cannot be changed by the optimization process. **Describe alternatives you've considered** This is the result of a bug report so there weren't really alternatives considered. **Additional context** Initially it was assumed that the optimization process would properly pull params forward (via the LLM) but this doesn't seem to always be the case. In the case of custom params, they aren't fed into the LLM calls since they're user-specified data (not specific to the actual optimization result). We now just pull these through as-is. In the case of tools, the model will be able to optimize the prompt to call a specific tool if multiple are provided, but we don't want to strip any tool information from the final result as it may be necessary for the calls to function.  --- > [!NOTE] > **Medium Risk** > Moderate risk because it changes how model parameters and `tools` are carried forward across optimization iterations and what gets auto-committed, which can affect runtime agent behavior if merging/restoration logic is wrong. > > **Overview** > Improves variation-application logic so LLM-generated `current_parameters` are **merged** into existing parameters instead of replacing them, preserving user-specified/custom settings (e.g. `max_tokens`, `response_format`) when the LLM omits them. > > Prevents tool drift by always restoring the original `tools` list (and logging when the LLM returns a different one) to avoid silently dropping user tools or leaking internal framework tools. > > Captures `model.custom` from the initial LaunchDarkly variation and includes it when auto-committing a winning variation; adds focused test coverage for parameter persistence, tool restoration/warnings, and `model.custom` propagation. > > Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 572a2aa. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).

**Requirements** - [x] I have added test coverage for new or changed functionality - [x] I have followed the repository's [pull request submission guidelines](../blob/main/CONTRIBUTING.md#submitting-pull-requests) - [x] I have validated my changes against all supported platform versions **Describe the solution you've provided** This is intended to demystify some of the results we're receiving from the optimization package - namely: - Total token counts are now accrued and reported with each result so that we can see if a user crosses the total allowed tokens threshold - Score results are reported for cost or latency if they're being optimized against as an item in the `score` result so that it can be shown on the UI - Finally, if quality has already met the required threshold the prompt now contains instructions to optimize only against cost (if cost is being optimized against) **Describe alternatives you've considered** This is in some ways a bug fix since this information wasn't clear to the user as to what was causing the failure. Technically additional feature/functionality but likely required to express the required information to make it actionable for the user. **Additional context** Cost and latency are only optimized for/include scores if they trigger the keywords that would lead to them being optimized. "Base" implementations without these features being used are unaffected.  --- > [!NOTE] > **Medium Risk** > Changes optimization pass/fail logic and persisted result payloads (new gate scores, baseline handling, token-budget semantics), which could affect when runs succeed/fail and what the UI/API receives. > > **Overview** > Improves optimization run reporting by tracking and persisting a single `accumulated_token_usage` total across agent, judge, and variation calls, and including it in result PATCH payloads (extending `generationTokens` to allow `accumulated_total`). > > Refactors latency/cost optimization to use explicit baseline values (not `history[0]`), caps history growth (`_trim_history`) for both standard and ground-truth flows, and adds synthetic `_latency_gate`/`_cost_gate` score entries so gate failures are visible in results. > > Adjusts run control flow so pass/fail is evaluated before token-limit checks (including GT batches and validation), and updates variation prompting to focus purely on cost reduction when quality is already passing; also relaxes the cost gate tolerance from 20% to 10% improvement and expands tests accordingly. > > Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 365fa94. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).

**Requirements** - [x] I have added test coverage for new or changed functionality - [x] I have followed the repository's [pull request submission guidelines](../blob/main/CONTRIBUTING.md#submitting-pull-requests) - [x] I have validated my changes against all supported platform versions **Describe the solution you've provided** Implements cost optimization in the same manner as latency optimization. Searches the acceptance statement for keywords pertaining to token usage/cost (e.g. costs, pricing, bill) and adds instructions to the variation generation to try to optimize for costs. Additionally has the acceptance statement prompt return instructions for the variation generation (ie, cheaper model, etc). **Describe alternatives you've considered** This is a feature addition. **Additional context** We'll be adding UI options for both latency and cost with adjustable thresholds, but these are still valid once those arrive since a mention of cost/latency means the user is trying to optimize for it.  --- > [!NOTE] > **Medium Risk** > Adds new cost-gating logic and changes iteration/batch bookkeeping (baseline tracking, history trimming, token-limit handling), which can affect optimization outcomes and persisted result records. Risk is moderated by extensive new unit tests covering the new gates and edge cases. > > **Overview** > Adds **cost optimization support** alongside existing latency optimization: acceptance statements are scanned for cost keywords, agent calls get per-turn `estimated_cost_usd` (via model pricing when available), and a new `_cost_gate` is applied similarly to `_latency_gate`, with both gates recorded as synthetic judge scores for visibility. > > Improves optimization loop correctness and observability by explicitly tracking baselines (duration and cost), trimming `_history` to bounded windows (standard and GT), counting variation-generation tokens into the run total, stamping `accumulated_token_usage` into result payloads, and refining token-limit behavior (treat `0` as unlimited and evaluate pass/fail before halting on budget). Also tightens model ID prefix stripping to avoid breaking Bedrock region-style IDs and updates package metadata naming/description. > > Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 4fc1ecf. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).

…ase (#190) **Requirements** - [x] I have added test coverage for new or changed functionality - [x] I have followed the repository's [pull request submission guidelines](../blob/main/CONTRIBUTING.md#submitting-pull-requests) - [x] I have validated my changes against all supported platform versions **Describe the solution you've provided** This adds a PROVENANCE.md file and registers it with release-please. **Describe alternatives you've considered** No alternatives here; required for security  --- > [!NOTE] > **Low Risk** > Low risk: documentation-only addition plus a release configuration tweak to include `PROVENANCE.md` in version bumps; no runtime code changes. > > **Overview** > Adds a new `packages/optimization/PROVENANCE.md` documenting how to verify published wheel provenance using GitHub artifact attestations. > > Updates `release-please-config.json` so `packages/optimization` treats `PROVENANCE.md` as an `extra-file`, ensuring the doc’s embedded version snippet is kept in sync during releases. > > Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 32dc4d0. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).

…unchdarkly/python-server-sdk-ai into aklatzke/AIC-2263/sdk-dx-improvements

**Requirements** - [x] I have added test coverage for new or changed functionality - [x] I have followed the repository's [pull request submission guidelines](../blob/main/CONTRIBUTING.md#submitting-pull-requests) - [x] I have validated my changes against all supported platform versions **Describe the solution you've provided** We added first class support for some fields on the UI -- - Latency optimization - Token optimization - Auto commit toggle This PR pulls them into the SDK. token/latency optimization are using their previous paths for now in this PR. Rather than the regex approach we just use the flags from the API now. The latency/cost optimization paths will be updated in a subsequent PR. **Describe alternatives you've considered** The initial implementation of these two code paths for optimizations were kind of hacky to begin with (just using a dictionary to look up words that might mean they want to do it). This was the intended solution.  --- > [!NOTE] > **Medium Risk** > Changes when latency/cost gates and judge templates apply (explicit flags vs inferred text) and alters config-judge loading and auto-commit gating, which can shift optimization outcomes for existing runs. > > **Overview** > Wires **LaunchDarkly agent optimization API** fields into the Python SDK: `latencyOptimization`, `tokenOptimization`, and `autoCommit` on remote configs, plus optional **`variation_key`** on `OptimizationOptions` / `GroundTruthOptimizationOptions` to start from a specific AI config variation (REST fetch; requires API key and `project_key`). > > **Latency and token behavior** no longer infer goals from acceptance-statement keyword regexes. Gates, judge prompt augmentations, variation prompts, and model-pricing warnings now key off **`latency_optimization`** and **`token_optimization`** booleans (from options or API). When unset/false, those paths stay off. > > **Config judges** resolve via raw flag **`variation()`** and local `{{key}}` interpolation (including `message_history` / `response_to_evaluate`) instead of `LDAIClient.judge_config`. System-only judge templates get an auto-built user turn. > > **`optimize_from_config`** maps the new API fields into built options; **auto-commit** runs only when both the fetched config’s `autoCommit` and caller options allow it. Tests drop regex helpers and cover the new flags, judge path, and `variation_key` validation. > > Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 509240f. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).

**Requirements** - [x] I have added test coverage for new or changed functionality - [x] I have followed the repository's [pull request submission guidelines](../blob/main/CONTRIBUTING.md#submitting-pull-requests) - [x] I have validated my changes against all supported platform versions **Describe the solution you've provided** Moves the cost and latency optimization process to happen as a post-process pass rather than attempting to optimize for everything in each loop. This helps reduce the amount of noise the LLM is dealing with in a single loop. Flow is now optimize for quality -> validate with additional samples -> optimize for meta (latency, cost). **Describe alternatives you've considered** The ultimate goal here is to move to distinct scorers/criteria that can be ranked. For now, this is a better solution than the all-in-one passes we were doing previously which could regress.  --- > [!NOTE] > **Medium Risk** > Changes when optimizations pass/fail, which model/parameters are committed, and callback timing—behavioral regressions are possible despite extensive test updates. > > **Overview** > **Cost and latency are no longer mixed into the main optimization loop.** Phase 1 only chases judge/validation quality; duration and cost gates are removed from standard turns, validation, and ground-truth samples. When latency or token optimization is enabled and Phase 1 succeeds, **`_run_cost_latency_phase`** runs with instructions frozen, reuses the winner’s input/variables, evaluates each distinct `model_choices` entry, applies latency/cost gates there, and picks the best passing candidate via normalized duration + cost vs baseline. > > **Prompting and variation generation split by phase:** `build_new_variation_prompt` no longer takes cost/latency flags; Phase 2 uses new **`build_token_latency_variation_prompt`** (content lock, model/param-only changes). LLM instruction edits in Phase 2 are reverted if they drift from the frozen winner. Judge prompts inject latency/cost guidance only while **`_in_cost_latency_phase`**. > > **Run lifecycle and API surface:** **`on_passing_result`** fires once with the true final context (Phase 2 winner or Phase 1 fallback); **`_handle_success`** can suppress that callback during intermediate success. Every agent turn adds a **`_meta`** score entry for raw latency/cost telemetry. **`auto_commit`** now persists **`parameters`** on the created variation. Tests were updated so Phase 1 success no longer depends on duration gates. > > Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 4eb0bb0. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).

cursor

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 69f1804. Configure here.}

cursor · 2026-06-18T17:33:02Z

+                if iteration >= self._options.max_attempts:
+                    return self._handle_failure(optimize_context, iteration)
+                self._record_baseline(last_ctx)
+                self._history.append(last_ctx)


Wrong baseline after validation fail

Medium Severity

When the main turn passes judges but validation fails, _record_baseline runs on the failed validation context only, and the successful primary turn is never baselined. Phase 2 cost/latency gates can then compare against the wrong run (different input/latency), skewing pass/fail for later attempts.

Additional Locations (1)

packages/optimization/src/ldai_optimizer/client.py#L3104-L3108

^{Reviewed by Cursor Bugbot for commit 69f1804. Configure here.}

cursor · 2026-06-18T17:33:02Z

+    # Try direct parse first
+    try:
+        return json.loads(response_str)
+    except json.JSONDecodeError:


Non-dict JSON crashes validation

Medium Severity

extract_json_from_response returns whatever json.loads produces on the first successful parse, without requiring a JSON object. If the model returns a JSON string, number, or array, validate_variation_response raises TypeError on membership checks instead of a controlled validation error, aborting variation generation.

^{Reviewed by Cursor Bugbot for commit 69f1804. Configure here.}

andrewklatzke added 29 commits March 25, 2026 17:09

feat: implements optimize method in SDK, code moved

9859d08

feat: implementation of agent optimization + tests

1712e4f

feat: implement ability to use completions or agents for judge calls

ea596a7

feat: all logs -> debug

2fd55e2

fix: lints + structured output tool rename

8481690

fix: lint + missed variable rename

f8e5509

fix: sort imports

c032aaf

fix: lint

aee6aa7

chore: break up long lines, add spaces where necessary

59c7ac7

chore: break up another long line

59f03f2

chore: fix on_turn path

e2ff561

chore: move prompts to own file, better debug info

af2dd03

chore: update tests, fix cursor feedback

ea43575

feat: implements LD API client, optimize_from_config path

2fecd54

feat: partially implement optimize_from_config

d3e1f96

feat: ground truth optimization path

44c8c59

feat: prevent overfitting via prompt changes and post-processing

8f9f1e2

chore: remove some dead code

a17fd6e

chore: remove provided_tool_handlers code

67fdbf1

fix: adjust iteration logic so validation doesn't consume them

3042984

feat: implement latency & token tracking for optimizations

288336e

feat: add optimization for duration

5d76276

feat: add auto-commit option

4cb8859

chore: add tests

ba369a2

chore: various fixes, improvements for optimization package

149aa76

feat: add shared dataclass for calls so they can be handled by same h…

31c8385

…andler

chore: improve call config, context so they're passable as a single t…

55674ae

…ype, remove required context_choices argument and default to anon

fix: success path + add test, cursor feedback

8f3468f

feat: dx improvements for optimization package

7074cfa

andrewklatzke requested a review from jsonbailey April 17, 2026 17:18

andrewklatzke added 29 commits May 6, 2026 14:17

feat: adds ability to optimize for cost

94de596

fix: remove unnecessary token path

dc82818

feat: adds reporting for cost and latency optimization failures

365fa94

fix: don't only evaluate final input in GT results

9bedf9e

fix: don't only evaluate final input in GT results

53f455f

fix: only strip known provider prefixes

66bc1f0

fix: address cursor feedback

f2f0894

chore: adjust package name

9c1d8d7

fix: pull model configs if available in options path

4fc1ecf

chore: (optimization) add provenance file + register with release please

32dc4d0

Merge branch 'main' into aklatzke/AIC-2263/sdk-dx-improvements

99dd7b8

chore: lint

849d41f

Merge branch 'aklatzke/AIC-2263/sdk-dx-improvements' of github.com:la…

c51e235

…unchdarkly/python-server-sdk-ai into aklatzke/AIC-2263/sdk-dx-improvements

chore: isort

f481ed3

chore: line lengths

27b248e

chore: fix agent init typing

4545203

implement api fields

b506fef

circumvents judge_config calls to make judge evaluations in optimization

806b564

cursor feedback

509240f

changes cost & latency optimization to post-process

af4ec5d

cursor feedback

de5f24f

more cursor feedback

aa0a77f

fix: ensure cost data is persisted

4eb0bb0

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement optimization code paths and functionality for initial release#140

feat: Implement optimization code paths and functionality for initial release#140
andrewklatzke wants to merge 73 commits into
mainfrom
aklatzke/AIC-2263/sdk-dx-improvements

andrewklatzke commented Apr 17, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 18, 2026

Uh oh!

cursor Bot Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andrewklatzke commented Apr 17, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 18, 2026

Choose a reason for hiding this comment

Wrong baseline after validation fail

Uh oh!

cursor Bot Jun 18, 2026

Choose a reason for hiding this comment

Non-dict JSON crashes validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andrewklatzke commented Apr 17, 2026 •

edited by cursor Bot

Loading